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O Abstract 

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art 
performance on a variety of machine learning tasks. Several researchers have recently pro- 
" ' posed schemes to parallelize SGD, but all require performance-destroying memory locking and 

^ synchronization. This work aims to show using novel theoretical analysis, algorithms, and im- 

plementation that SGD can be implemented without any locking. We present an update scheme 
called Hogwild! which allows processors access to shared memory with the possibility of over- 
writing each other's work. We show that when the associated optimization problem is sparse, 
meaning most gradient updates only modify small parts of the decision variable, then Hogwild! 
d achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild! 

^ outperforms alternative schemes that use locking by an order of magnitude. 

Keywords. Incremental gradient methods, Machine learning. Parallel computing, Multicore 
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^ 1 Introduction 

With its small memory footprint, robustness against noise, and rapid learning rates. Stochastic Gra- 



X 



\C> dient Descent (SGD) has proved to be well suited to data-intensive machine learning tasks [3|[5,26 

However, SGD's scalability is limited by its inherently sequential nature; it is difficult to par- 
allelize. Nevertheless, the recent emergence of inexpensive multicore processors and mammoth, 
web-scale data sets has motivated researchers to develop several clever parallelization schemes for 
SGD [4||To||T2|[l6|[30|. As many large data sets are currently pre-processed in a MapReduce-like 
parallel-processing framework, much of the recent work on parallel SGD has focused naturally on 
MapReduce implementations. MapReduce is a powerful tool developed at Google for extracting 
information from huge logs (e.g., "find all the urls from a 100TB of Web data") that was designed 
to ensure fault tolerance and to simplify the maintenance and programming of large clusters of 
machines [9|. But MapReduce is not ideally suited for online, numerically intensive data analy- 
sis. Iterative computation is difficult to express in MapReduce, and the overhead to ensure fault 
tolerance can result in dismal throughput. Indeed, even Google researchers themselves suggest 
that other systems, for example Dremel, are more appropriate than MapReduce for data analysis 
tasks [21]. 

For some data sets, the sheer size of the data dictates that one use a cluster of machines. How- 
ever, there are a host of problems in which, after appropriate preprocessing, the data necessary for 
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statistical analysis may consist of a few terabytes or less. For such problems, one can use a single 
inexpensive work station as opposed to a hundred thousand dollar cluster. Multicore systems have 
significant performance advantages, including (1) low latency and high throughput shared main 
memory (a processor in such a system can write and read the shared physical memory at over 
12GB/s with latency in the tens of nanoseconds); and (2) high bandwidth off multiple disks (a 
thousand-dollar RAID can pump data into main memory at over IGB/s). In contrast, a typical 
MapReduce setup will read incoming data at rates less than tens of MB / s due to frequent check- 
pointing for fault tolerance. The high rates achievable by multicore systems move the bottlenecks 
in parallel computation to synchronization (or locking) amongst the processors [2 13 . Thus, to 



enable scalable data analysis on a multicore machine, any performant solution must minimize the 
overhead of locking. 

In this work, we propose a simple strategy for eliminating the overhead associated with locking: 
run SGD in parallel without locks, a strategy that we call Hogwild!. In Hogwild!, processors are 
allowed equal access to shared memory and are able to update individual components of memory 
at will. Such a lock-free scheme might appear doomed to fail as processors could overwrite each 
other's progress. However, when the data access is sparse, meaning that individual SGD steps only 
modify a small part of the decision variable, we show that memory overwrites are rare and that 
they introduce barely any error into the computation when they do occur. We demonstrate both 
theoretically and experimentally a near linear speedup with the number of processors on commonly 
occurring sparse learning problems. 

In Section [2j we formalize a notion of sparsity that is sufficient to guarantee such a speedup 
and provide canonical examples of sparse machine learning problems in classification, collaborative 
filtering, and graph cuts. Our notion of sparsity allows us to provide theoretical guarantees of 
linear speedups in Section |4j As a by-product of our analysis, we also derive rates of convergence 
for algorithms with constant stepsizes. We demonstrate that robust \/k convergence rates are 
possible with constant stepsize schemes that implement an exponential back-off in the constant 
over time. This result is interesting in of itself and shows that one need not settle for l/\/~k rates 
to ensure robustness in SGD algorithms. 

In practice, we find that computational performance of a lock-free procedure exceeds even our 
theoretical guarantees. We experimentally compare lock-free SGD to several recently proposed 
methods. We show that all methods that propose memory locking are significantly slower than 
their respective lock-free counterparts on a variety of machine learning applications. 



2 Sparse Separable Cost Functions 

Our goal throughout is to minimize a function f : X CI M" — M of the form 

f{x) = ^fe{Xe). (2.1) 

Here e denotes a small subset of {l,...,n} and Xg denotes the values of the vector x on the 
coordinates indexed by e. The key observation that underlies our lock-free approach is that the 
natural cost functions associated with many machine learning problems of interest are sparse in the 
sense that \E\ and n are both very large but each individual fe acts only on a very small number 
of components of x. That is, each subvector Xg contains just a few components of x. 
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Figure 1: Example graphs induced by cost function, (a) A sparse SVM induces a hypergraph 
where each hyperedge corresponds to one example, (b) A matrix completion example induces 
a bipartite graph between the rows and columns with an edge between two nodes if an entry is 
revealed, (c) The induced hypergraph in a graph-cut problem is simply the graph whose cuts 
we aim to find. 



The cost function (2.1) induces a hypergraph G = iy^E) whose nodes are the individual com- 



ponents of X. Each subvector Xe induces an edge in the graph e G £" consisting of some subset of 
nodes. A few examples illustrate this concept. 



Sparse SVM. Suppose our goal is to fit a support vector machine to some data pairs E 
{(zi,yi), . . . , [z\E\-,y\E\)} where z £ M" and y is a label for each {z,y) £ E. 

minimizea^ ^ ^ max(l — y^y^x^ Zq^^ 0) + A||a;||2 , 
and we know a priori that the examples Za are very sparse (see for example [14]). To write this 



(2.2) 



cost function in the form of (2.1), let Ca denote the components which are non-zero in Za and let 



Then we can rewrite (2.2) as 



du denote the number of training examples which are non-zero in component u {u = 1, 2, . . . , n). 

(2.3) 



minimize^ ( max(l — yaX^ Za,Q) + A | . 



Each term in the sum (2.3) depends only on the components of x indexed by the set Cq. 



Matrix Completion. In the matrix completion problem, we are provided entries of a low-rank, 
Ur X Uc matrix Z from the index set E. Such problems arise in collaborative filtering, Euclidean 
distance estimation, and clustering 8) |17p4 . Our goal is to reconstruct Z from this sparse sampling 
of data. A popular heuristic recovers the estimate of Z as a product LR* of factors obtained from 
the following minimization: 



minimize(£,^i^) ^ 



{LuRy — Zuv)"^ + 2 II-^IIf ~^ 2 11-^" 



F > 



(2.4) 



where L is x r, i? is nc x r and (resp. R^) denotes the tith (resp. fth) row of L (resp. R) 

} 



17,24,27 . To put this problem in sparse form, i.e., as (|2.1|), we write (|2.4|) as 
minimize( 



(L,R) X] ^{LuRl - Zuv)^ + 



2\Eu-\ 



\Lu\\'f + 



2|E_„| 



where E^. 



{u,v)eE 

{v : {u,v) E E} and E^^ = {u : {u,v) G E}. 
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Graph Cuts. Problems involving minimum cuts in graphs frequently arise in machine learning 
(see |6| for a comprehensive survey). In such problems, we are given a sparse, nonnegative matrix W 
which indexes similarity between entities. Our goal is to find a partition of the index set {1, . . . , n} 
that best conforms to this similarity matrix. Here the graph structure is explicitly determined by 
the similarity matrix W; arcs correspond to nonzero entries in W. We want to match each string 
to some list of D entities. Each node is associated with a vector Xi in the D-dimensional simplex 
Sd = {C £ '■ Cv ^ Ylv=i Cv = !}• Here, two-way cuts use D = 2, but multiway-cuts with 
tens of thousands of classes also arise in entity resolution problems |18| . For example, we may have 
a list of n strings, and Wuv might index the similarity of each string. Several authors (e.g., 
propose to minimize the cost function 

minimize^; — subject to Xy ^ Sd iov v = 1, . . . ,n . (2-5) 

In all three of the preceding examples, the number of components involved in a particular term 
/e is a small fraction of the total number of entries. We formalize this notion by defining the 
following statistics of the hypergraph G: 

I I A maxi<„<„ |{e G S : u G e}| maXees |{e G -E ■.ene^^}\ 
Si := max e , A := — , p := — . (^-o) 

eeE \E\ \E\ 

The quantity SI simply quantifies the size of the hyper edges, p determines the maximum fraction of 
edges that intersect any given edge. A determines the maximum fraction of edges that intersect any 
variable, p is a measure of the sparsity of the hypergraph, while A measures the node-regularity. 
For our examples, we can make the following observations about p and A. 

1. Sparse SVM. A is simply the maximum frequency that any feature appears in an example, 
while p measures how clustered the hypergraph is. If some features are very common across 
the data set, then p will be close to one. 

2. Matrix Completion. If we assume that the provided examples are sampled uniformly at 
random and we see more than nclog(nc) of them, then A ~ ^nd p ~ 2iog(n,.) ^ fj^j^^g 
follows from a coupon collector argument [s]. 

3. Graph Cuts. A is the maximum degree divided by \E\, and p is at most 2A. 

We now describe a simple protocol that achieves a linear speedup in the number of processors 
when fi. A, and p are relatively small. 

3 The Hogwild! Algorithm 

Here we discuss the parallel processing setup. We assume a shared memory model with p proces- 
sors. The decision variable x is accessible to all processors. Each processor can read x, and can 
contribute an update vector to x. The vector x is stored in shared memory, and we assume that 
the componentwise addition operation is atomic, that is 
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Algorithm 1 Hogwild! update for individual processors 

1: loop 

2: Sample e uniformly at random from E 
3: Read current state Xg and evaluate Ge{x) 
4: for V ^ e do <— Xy — 'jb'^Ge{x) 
5: end loop 



can be performed atomically by any processor for a scalar a and v G {1, . . . ,n}. This operation 
does not require a separate locking structure on most modern hardware: such an operation is 
a single atomic instruction on GPUs and DSPs, and it can be implemented via a compare-and- 
exchange operation on a general purpose multicore processor like the Intel Nehalem. In contrast, 
the operation of updating many components at once requires an auxiliary locking structure. 

Each processor then follows the procedure in Algorithm [TJ To fully describe the algorithm, let 
bv denote one of the standard basis elements in M", with v ranging from 1,. . . ,n. That is, by is 
equal to 1 on the vth. component and otherwise. Let Vy denote the Euclidean projection matrix 
onto the vth. coordinate, i.e., Vv = b^bj^. Vv is a diagonal matrix equal to 1 on the vth. diagonal and 
zeros elsewhere. Let Ge{x) E denote a gradient or subgradient of the function /e multiplied by 
\E\. That is, we extend fe from a function on the coordinates of e to all of M" simply by ignoring 
the components in -le (i.e., not in e). Then 

\E\-'Ge{x) G dfeix). 

Here, Gg is equal to zero on the components in -le. Using a sparse representation, we can calculate 
Ge{x), only knowing the values of x in the components indexed by e. Note that as a consequence 
of the uniform random sampling of e from E, we have 

E[Ge{Xe)] G df{x). 

In Algorithm [T| each processor samples an term e £ E uniformly at random, computes the 
gradient of fe at Xe, and then writes 

Xy Xjj — jb^Ge{x), for each u G e (3-1) 

Importantly, note that the processor modifies only the variables indexed by e, leaving all of the 
components in -le (i.e., not in e) alone. We assume that the stepsize 7 is a fixed constant. Even 
though the processors have no knowledge as to whether any of the other processors have modified x, 
we define Xj to be the state of the decision variable x after j updates have been performecj^ Since 
two processors can write to x at the same time, we need to be a bit careful with this definition, but 
we simply break ties at random. Note that xj is generally updated with a stale gradient, which 
is based on a value of x read many clock cycles earlier. We use ^^(j) to denote the value of the 
decision variable used to compute the gradient or subgradient that yields the state Xj. 

In what follows, we provide conditions under which this asynchronous, incremental gradient 
algorithm converges. Moreover, we show that if the hypergraph induced by / is isotropic and 
sparse, then this algorithm converges in nearly the same number of gradient steps as its serial 
counterpart. Since we are running in parallel and without locks, this means that we get a nearly 
linear speedup in terms of the number of processors. 

^Our notation overloads subscripts of x. For clarity throughout, subscripts and k refer to iteration counts, 
and V and e refer to components or subsets of components. 
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4 Fast Rates for Lock-Free Parallelism 



We now turn to our theoretical analysis of Hogwild! protocols. To make the analysis tractable, we 
assume that we update with the following "with replacement" procedure: each processor samples 
an edge e uniformly at random and computes a subgradient of fe at the current value of the decision 
variable. Then it chooses an v £ e uniformly at random and updates 

Xt, <— Xt, — 7|e|6^Ge(x) 



Note that the stepsize is a factor |e| larger than the step in (3.1). Also note that this update is 
completely equivalent to 



x^x--f\e\VjGe{x). (4.1) 

This notation will be more convenient for the subsequent analysis. 

This with replacement scheme assumes that a gradient is computed and then only one of its 
components is used to update the decision variable. Such a scheme is computationally wasteful as 
the rest of the components of the gradient carry information for decreasing the cost. Consequently, 
in practice and in our experiments, we perform a modification of this procedure. We partition out 
the edges without replacement to all of the processors at the beginning of each epoch. The processors 
then perform full updates of all of the components of each edge in their respective queues. However, 
we emphasize again that we do not implement any locking mechanisms on any of the variables. We 
do not analyze this "without replacement" procedure because no one has achieved tractable analyses 
for SGD in any without replacement sampling models. Indeed, to our knowledge, all analysis of 
without-replacement sampling yields rates that are comparable to a standard subgradient descent 



algorithm which takes steps along the full gradient of (2.1) (see, for example [22|). That is, these 
analyses suggest that without-replacement sampling should require a factor of more steps than 
with-replacement sampling. In practice, this worst case behavior is never observed. In fact, it is 
conventional wisdom in machine learning that without-replacement sampling in stochastic gradient 
descent actually outperforms the with-replacement variants on which all of the analysis is based. 

To state our theoretical results, we must describe several quantities that important in the 
analysis of our parallel stochastic gradient descent scheme. We follow the notation and assumptions 



of Nemirovski et al l23J. To simplify the analysis, we will assume that each /e in (2.1) is a convex 



function. We assume Lipschitz continuous differentiability of / with Lipschitz constant L: 

||V/(x')- V/(x)|| <L||x'-x||, Vx',xGX (4.2) 
We also assume / is strongly convex with modulus c. By this we mean that 

/(x) > /(x) + (x' -x)^V/(x) + -xf , for all x',x G X. (4.3) 

When / is strongly convex, there exists a unique minimizer x* and we denote = /(x^,). We 
additionally assume that there exists a constant M such that 

||G'e(xe)||2 < M almost surely for all x G X . (4.4) 

We assume throughout that 7c < 1. (Indeed, when 7c > 1, even the ordinary gradient descent 
algorithms will diverge.) 

Our main results are summarized by the following 
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Proposition 4.1 Suppose in Algorithm^ that the lag between when a gradient is computed and 
when it is used in step j — namely, j — k{j) — is always less than or equal to t, and 7 is defined 
to be 

— (A ^\ 

^ ~ 2LM'^n (1 + 6pr + 4t217A1/2) " 

for some e > and 1!} G (0, 1). Define Dq := \\xq — x*|p and let k be an integer satisfying 



ILlvPn (1 + Qtp + 6t217A1/2) log(LDo/e 
Then after k component updates of x, we have E[/(xfc) — /j,] < e. 



k > ^ ^ " ^ . (4.6) 



In the case that r = 0, this reduces to precisely the rate achieved by the serial SGD protocol. 
A similar rate is achieved if r = o(n^/^) as p and A are typically both o(l/n). In our setting, r 
is proportional to the number of processors, and hence as long as the number of processors is less 
n"^/'*, we get nearly the same recursion as in the linear rate. 



We prove Proposition 4.1 in two steps in the Appendix. First, we demonstrate that the sequence 



a 



•] ~ 2 



\¥.[\\xj — x^lp] satisfies a recursion of the form Oj < (1 — Cr^){aj+i — aoo) + Ooo for some 



constant that depends on many of the algorithm parameters but not on the state, and some 
constant Cr < c. This Cr is an "effective curvature" for the problem which is smaller that the true 
curvature c because of the errors introduced by our update rule. Using the fact that Cr^ < 1, we 
will show in Section [5] how to determine an upper bound on k for which < e/L. Proposition 4.1 



then follows because E[f{xk) — fix-k)] < Luk since the gradient of / is Lipschitz. A full proof is 
provided in the appendix. 



Note that up to the log(l/e) term in (4.6), our analysis nearly provides al/k rate of convergence 
for a constant stepsize SGD scheme, both in the serial and parallel cases. Moreover, note that our 
rate of convergence is fairly robust to error in the value of c; we pay linearly for our underestimate 
of the curvature of /. In contrast, Nemirovski et al demonstrate that when the stepsize is inversely 
proportional to the iteration counter, an overestimate of c can result in exponential slow-down |23]! 



We now turn to demonstrating that we can eliminate the log term from (4.6) by a slightly more 



complicated protocol where the stepsize is slowly decreased after a large number of iterations. 



5 Robust 1/k rates. 

Suppose we run Algorithm [T] for a fixed number of gradient updates K with stepsize 7 < 1/c. Then, 
we wait for the threads to coalesce, reduce 7 by a constant factor /3 S (0,1), and run for (3~^K 
iterations. In some sense, this piecewise constant stepsize protocol approximates a diminishing 
stepsize. The main difference with the following analysis from previous work is that our stepsizes 
are always less than 1/c in contrast to beginning with very large stepsizes. Always working with 
small stepsizes allows us to avoid the possible exponential slow-downs that occur with standard 
diminishing stepsize schemes. 

To be precise, suppose Ok is any sequence of real numbers satisfying 

Ofe+i < (1 - Cr7)(afc - 000(7)) + «oo(7) (5.1) 
where Ooo is some non-negative function of 7 satisfying 

(7) < 75 
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and Cr and B are constants. This recursion underlies many convergence proofs for SGD where 
denotes the distance to the optimal solution after k iterations. We will derive appropriate constants 
for Hogwild! in the Appendix. We will also discuss below what these constants are for standard 
stochastic gradient descent algorithms. 



Factoring out the dependence on 7 will be useful in what follows. Unwrapping (5.1) we have 

flfc < (1 - Cr7)''(ao - 000(7)) + 000(7) • 

Suppose we want this quantity to be less than e. It is sufficient that both terms are less than e/2. 
For the second term, this means that it is sufficient to set 



For the first term, we then need 
which holds if 



(1 - ^Cr)^ ao < e/2 



k > i^^^i^ . (5.3) 



By (5.2), we should pick 7 = for i? G (0, 1]. Combining this with (5.3) tells us that after 

2Slog(2ao/e) 



k > 



'decr 



iterations we will have < e. This right off the bat almost gives us a l//c rate, modulo the log(l/e) 
factor. 

To eliminate the log factor, we can implement a backoff scheme where we reduce the stepsize by 
a constant factor after several iterations. This backoff scheme will have two phases: the first phase 
will consist of converging to the ball about of squared radius less than ^ at an exponential 
rate. Then we will converge to by shrinking the stepsize. 

To calculate the number of iterates required to get inside a ball of squared radius ^ , suppose 
the initial stepsize is chosen as7 = ^(0<'i?<l). This choice of stepsize guarantees that 
the Ofc converge to Ooo- We use the parameter •& to demonstrate that we do not suffer much for 



underestimating the optimal stepsize (i.e., ■!? = 1) in our algorithms. Using (5.3) we find that 



A;>^-Mog(^) (5.4) 

iterations are sufficient to converge to this ball. Note that this is a linear rate of convergence. 

Now assume that oq < ^f^. Let's reduce the stepsize by a factor of (3 each epoch. This reduces 
the achieved e by a factor of /?. Thus, after log^(ao/e) epochs, we will be at accuracy e. The total 



number of iterations required is then the sum of terms with the form (5.3), with oq set to be the 
radius achieved by the previous epoch and e set to be /3 times this oq. Hence, for epoch number u, 
the initial distance is /3'^~^ao and the final radius is /3'^. Summing over all of the epochs (except for 
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the initial phase) gives 



log^(aoA) log^(aoA) 

^ log(2//3) ^ log(2//3) ^p ^ 



k=l 



k=l 



< 



< 



log(2//3) /3-i(Qo/g)-l 

i9 /3-1 - 1 

ap log(2//3) 
i9e 1-/3 
2i?log(2//3) 



(5.5) 



1-/3 



This expression is minimized by selecting a backoff parameter ~ 0.37. Also, note that when we 
reduce the stepsize by /3, we need to run for /3~^ more iterations. 

Combining (5.4) and (5.5), we estimate a total number of iterations equal to 

2B log(2//3) 



k > 1?"^ log 



apCr 



+ 



Cre 



1-/3 



are sufficient to guarantee that < e. 

Rearranging terms, the following two expressions give e in terms of all of the algorithm param- 
eters: 

2 1og(2//3) B 1 

1-/3 ' cr' k-^-nog{'^)' 



e < 



(5.6) 



5.1 Consequences for serial SGD 

Let us compare the results of this constant step-size protocol to one where the stepsize at iteration k 
is set to be 70/^ for some initial step size 7 for the standard (serial) incremental gradient algorithm 
applied to ( |2.1[ ). Nemirovski et al [23] show that the expected squared distance to the optimal 
solution, Ofc, satisfies 

ak+i < (1 - 2c7fc)afc + \^Im'^ . 



We can put this recursion in the form (5.1) by setting 7^ 
The authors of 



2c, B = and Coo 
with > 1 yields a bound 



23 



demonstrate that a large step size: 7^ 



7, Cr = 2c, if = and Coo = 
_©_ 

2cfc 



1 

Ofc < — max 



M2 



02 



46 



On the other hand, a constant step size protocol achieves 

^ ^ ^ log(2//3) ^ 

- 4(1 - /3) c2 ^ _ ^-1 1 



This bound is obtained by plugging the algorithm parameters into (5.6) and letting Dq = 2aQ. 



Note that both bounds have asymptotically the same dependence on iVf , c, and k. The expres- 



sion 



log(2//3) 
4(1-/3) 
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is minimized when (3 ~ 0.37 and is equal to 1.34. The expression 

9^ 
4G - 4 

is minimized when B = 2 and is equal to 1 at this minimum. So the leading constant is slightly 
worse in the constant stepsize protocol when all of the parameters are set optimally. However, if 
Dq > M'^/c^, the 1/k protocol has error proportional to Dq, but our constant stepsize protocol still 
has only a logarithmic dependence on the initial distance. Moreover, the constant stepsize scheme 
is much more robust to overestimates of the curvature parameter c. For the 1/k protocols, if one 
overestimates the curvature (corresponding to a small value of 0), one can get arbitrarily slow 
rates of convergence. An simple, one dimensional example in shows that Q = 0.2 can yield a 
convergence rate of k~^^^. In our scheme, = 0.2 simply increases the number of iterations by a 
factor of 5. 

The proposed fix in |23| for the sensitivity to curvature estimates results in asymptotically 
slower convergence rates of 1/ \/k. It is important to note that we need not settle for these slower 
rates and can still achieve robust convergence at 1/A; rates. 

5.2 Parallel Implementation of a Backoff Scheme 

The scheme described about results in a 1/k rate of convergence for Hogwild! with the only 
synchronization overhead occurring at the end of each "round" or "epoch" of iteration. When 
implementing a backoff scheme for Hogwild!, the processors have to agree on when to reduce the 
stepsize. One simple scheme for this is to run all of the processors for a fixed number of iterations, 
wait for all of the threads to complete, and then globally reduce the stepsize in a master thread. We 
note that one can eliminate the need for the threads to coalesce by sending out-of-band messages 
to the processors to signal when to reduce 7. This complicates the theoretical analysis as there 
may be times when different processors are running with different stepsizes, but in practice could 
allow one to avoid synchronization costs. We do not implement this scheme, and so do not analyze 
this idea further. 



6 Related Work 

Most schemes for parallelizing stochastic gradient descent are variants of ideas presented in the 
seminal text by Bertsekas and Tsitsiklis |4|. For instance, in this text, they describe using stale 
gradient updates computed across many computers in a master- worker setting and describe settings 
where different processors control access to particular components of the decision variable. They 
prove global convergence of these approaches, but do not provide rates of convergence (This is one 
way in which our work extends this prior research) . These authors also show that SGD convergence 
is robust to a variety of models of delay in computation and communication in [29] . 

We also note that constant stepsize protocols with backoff procedures are canonical in SGD 
practice, but perhaps not in theory. Some theoretical work which has at least demonstrated con- 
vergence of these protocols can be found in |20[|28| . These works do not establish the 1/k rates 
which we provided above. 

Recently, a variety of parallel schemes have been proposed in a variety of contexts. In MapRe- 
duce settings, Zinkevich et al proposed running many instances of stochastic gradient descent on 
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Hog wild! Round Robin 



type 


data 
set 


size p A 
(GB) 


time train test 
(s) error error 


time train test 
(s) error error 


SVM 


RCVl 


0.9 0.44 1.0 


9.5 0.297 0.339 


61.8 0.297 0.339 


MC 


Nctflix 
KDD 
Jumbo 


1.5 2.5e-3 2.3e-3 
3.9 3.0e-3 1.8e-3 
30 2.6e-7 1.4e-7 


301.0 0.754 0.928 
877.5 19.5 22.6 
9453.5 0.031 0.013 


2569.1 0.754 0.927 
7139.0 19.5 22.6 
N/A N/A N/A 


Cuts 


DBLife 
Abdomen 


3e-3 8.6e-3 4.3e-3 
18 9.2e-4 9.2e-4 


230.0 10.6 N/A 
1181.4 3.99 N/A 


413.5 10.5 N/A 
7467.25 3.99 N/A 



Figure 2: Comparison of wall clock time across of Hogwild! and RR. Each algorithm is run 
for 20 epochs and parallelized over 10 cores. 



different machines and averaging tlieir output fSO^. Though the authors claim this method can 
reduce both the variance of their estimate and the overall bias, we show in our experiments that 
for the sorts of problems we are concerned with, this method does not outperform a serial scheme. 
Schemes involving the averaging of gradients via a distributed protocol have also been proposed 



by several authors [10, 12 . While these methods do achieve linear speedups, they are difficult 
to implement efficiently on multicore machines as they require massive communication overhead. 
Distributed averaging of gradients requires message passing between the cores, and the cores need 
to synchronize frequently in order to compute reasonable gradient averages. 

The work most closely related to our own is a round-robin scheme proposed by Langford et 
al [16] . In this scheme, the processors are ordered and each update the decision variable in order. 
When the time required to lock memory for writing is dwarfed by the gradient computation time, 
this method results in a linear speedup, as the errors induced by the lag in the gradients are not 
too severe. However, we note that in many applications of interest in machine learning, gradient 
computation time is incredibly fast, and we now demonstrate that in a variety of applications, 
Hogwild! outperforms such a round-robin approach by an order of magnitude. 



7 Experiments 

We ran numerical experiments on a variety of machine learning tasks, and compared against a 



round-robin approach proposed in 16 and implemented in Vowpal Wabbit [15] . We refer to this 
approach as RR. To be as fair as possible to prior art, we hand coded RR to be nearly identical 
to the Hogwild! approach, with the only difference being the schedule for how the gradients are 
updated. One notable change in RR from the Vowpal Wabbit software release is that we optimized 
RR's locking and signaling mechanisms to use spinlocks and busy waits (there is no need for generic 
signaling to implement round robin). We verified that this optimization results in nearly an order 
of magnitude increase in wall clock time for all problems that we discuss. 

We also compare against a model which we call AIG which can be seen as a middle ground 
between RR and Hogwild!. AIG runs a protocol identical to Hogwild! except that it locks all 
of the variables in e in before and after the for loop on line 4 of Algorithm [T| Our experiments 
demonstrate that even this fine-grained locking induces undesirable slow-downs. 

All of the experiments were coded in C++ are run on an identical configuration: a dual Xeon 
X650 CPUs (6 cores each x 2 hyperthreading) machine with 24GB of RAM and a software RAID-0 
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Number of Splits Number of Splits Number of Splits 

Figure 3: Total CPU time versus number of threads for (a) RCVl, (b) Abdomen, and (c) 
DBLife. 



over 7 2TB Seagate Constellation 7200RPM disks. The kernel is Linux 2.6.18-128. We never use 
more than 2GB of memory. All training data is stored on a seven-disk raid 0. We implemented a 
custom file scanner to demonstrate the speed of reading data sets of disk into small shared memory. 
This allows us to read data from the raid at a rate of nearly IGB/s. 

All of the experiments use a constant stepsize 7 which is diminished by a factor /3 at the end of 
each pass over the training set. We run all experiments for 20 such passes, even though less epochs 
are often sufficient for convergence. We show results for the largest value of the learning rate 7 
which converges and we use /3 = 0.9 throughout. We note that the results look the same across a 
large range of (7, /3) pairs and that all three parallelization schemes achieve train and test errors 
within a few percent of one another. We present experiments on the classes of problems described 
in Section [2j 

Sparse SVM. We tested our sparse SVM implementation on the Reuters RCVl data set on 
the binary text classification task CCAT ^19j. There are 804,414 examples split into 23,149 training 
and 781,265 test examples, and there are 47,236 features. We swapped the training set and the 
test set for our experiments to demonstrate the scalability of the parallel multicore algorithms. 
In this example, p = 0.44 and A = 1.0 — large values that suggest a bad case for Hogwild!. 
Nevertheless, in Figure |3]^ a), we see that Hogwild! is able to achieve a factor of 3 speedup with 
while RR gets worse as more threads are added. Indeed, for fast gradients, RR is worse than a 
serial implementation. 

For this data set, we also implemented the approach in |30| which runs multiple SGD runs in 
parallel and averages their output. In Figure [sj^b), we display at the train error of the ensemble 
average across parallel threads at the end of each pass over the data. We note that the threads 
only communicate at the very end of the computation, but we want to demonstrate the effect of 
parallelization on train error. Each of the parallel threads touches every data example in each pass. 
Thus, the 10 thread run does lOx more gradient computations than the serial version. Here, the 
error is the same whether we run in serial or with ten instances. We conclude that on this problem, 
there is no advantage to running in parallel with this averaging scheme. 

Matrix Completion. We ran Hogwild! on three very large matrix completion problems. 
The Netflix Prize data set has 17,770 rows, 480,189 columns, and 100,198,805 revealed entries. The 
KDD Cup 2011 (task 2) data set has 624,961 rows, 1,000,990, columns and 252,800,275 revealed 
entries. We also synthesized a low-rank matrix with rank 10, le7 rows and columns, and 2e9 
revealed entries. We refer to this instance as "Jumbo." In this synthetic example, p and A are 
both around le-7. These values contrast sharply with the real data sets where p and A are both 
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Figure 4: Total CPU time versus number of threads for the matrix completion problems (a) 
Nctflix Prize, (b) KDD Cup 2011, and (c) the synthetic Jumbo experiment. 

on the order of le-3. 

Figure [sj^a) shows the speedups for these three data sets using Hogwild!. Note that the 
Jumbo and KDD examples do not fit in our allotted memory, but even when reading data off disk, 
Hogwild! attains a near linear speedup. The Jumbo problem takes just over two and a half 
hours to complete. Speedup graphs comparing Hogwild! to AIG and RR on the three matrix 
completion experiments are provided in Figure |4j Similar to the other experiments with quickly 
computable gradients, RR does not show any improvement over a serial approach. In fact, with 
10 threads, RR is 12% slower than serial on KDD Cup and 62% slower on Netflix. In fact, it is 
too slow to complete the Jumbo experiment in any reasonable amount of time, while the 10-way 
parallel Hogwild! implementation solves this problem in under three hours. 

Graph Cuts. Our first cut problem was a standard image segmentation by graph cuts problem 
popular in computer vision. We computed a two-way cut of the abdomen data set [T] . This data 
set consists of a volumetric scan of a human abdomen, and the goal is to segment the image into 
organs. The image has 512 x 512 x 551 voxels, and the associated graph is 6-connected with 
maximum capacity 10. Both p and A are equal to 9.2e-4 We see that Hogwild! speeds up the cut 
problem by more than a factor of 4 with 10 threads, while RR is twice as slow as the serial version. 

Our second graph cut problem sought a mulit-way cut to determine entity recognition in a 
large database of web data. We created a data set of clean entity lists from the DBLife website 
and of entity mentions from the DBLife Web Crawl [TT]. The data set consists of 18,167 entities 
and 180,110 mentions and similarities given by string similarity. In this problem each stochastic 
gradient step must compute a Euclidean projection onto a simplex of dimension 18,167. As a 
result, the individual stochastic gradient steps are quite slow. Nonetheless, the problem is still 
very sparse with /9=8.6e-3 and A=4.2e-3. Consequently, in Figure [3j we see the that Hogwild! 
achieves a ninefold speedup with 10 cores. Since the gradients are slow, RR is able to achieve a 
parallel speedup for this problem, however the speedup with ten processors is only by a factor of 5. 
That is, even in this case where the gradient computations are very slow, Hogwild! outperforms 
a round-robin scheme. 

What if the gradients are slow? As we saw with the DBLIFE data set, the RR method 
does get a nearly linear speedup when the gradient computation is slow. This raises the question 
whether RR ever outperforms Hogwild! for slow gradients. To answer this question, we ran the 
RCVl experiment again and introduced an artificial delay at the end of each gradient computation 
to simulate a slow gradient. In Figure [5][^c) , we plot the wall clock time required to solve the SVM 
problem as we vary the delay for both the RR and Hogwild! approaches. 
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Figure 5: (a) Speedup for the three matrix completion problems with Hogwild!. In all three 
cases, massive speedup is achieved via parallelism, (b) The training error at the end of each 
epoch of SVM training on RCVl for the averaging algorithm [SOj. (c) Speedup achieved over 
serial method for various levels of delays (measured in nanoseconds). 

Notice that Hogwild! achieves a greater decrease in computation time across the board. The 
speedups for both methods are the same when the delay is few milliseconds. That is, if a gradient 
takes longer than one millisecond to compute, RR is on par with Hogwild! (but not better). At 
this rate, one is only able to compute about a million stochastic gradients per hour, so the gradient 
computations must be very labor intensive in order for the RR method to be competitive. 

8 Conclusions 

Our proposed Hogwild! algorithm takes advantage of sparsity in machine learning problems 
to enable near linear speedups on a variety of applications. Empirically, our implementations 
outperform our theoretical analysis. For instance, p is quite large in the RCVl SVM problem, yet 
we still obtain significant speedups. Moreover, our algorithms allow parallel speedup even when 
the gradients are computationally intensive. 

Our Hogwild! schemes can be generalized to problems where some of the variables occur quite 
frequently as well. We could choose to not update certain variables that would be in particularly 
high contention. For instance, we might want to add a bias term to our Support Vector Machine, 
and we could still run a Hogwild! scheme, updating the bias only every thousand iterations or 
so. 

For future work, it would be of interest to enumerate structures that allow for parallel gradient 
computations with no collisions at all. That is, it may be possible to bias the SGD iterations to 
completely avoid memory contention between processors. For example, recent work proposed a 
biased ordering of the stochastic gradients in matrix completion problems that completely avoids 
memory contention between processors |25j . An investigation into how to generalize this approach to 
other structures and problems would enable even faster computation of machine learning problems. 
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A Analysis of Hogwild! 



It follows by rearrangement of 




that 




for all X ^ X. 




In particular, by setting x' = (the minimizer) we have 




for all X G X. 




We will make use of these identities frequently in what follows. 
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A.l Developing the recursion 

For 1 < w < n, let Vv denote the projection onto the vih. component of M". {Vv is a diagonal matrix 
with a 1 in the vih. diagonal position and zeros elsewhere.) Similarly, for e £ E, we let Ve denote 
the projection onto the components indexed by e — a diagonal matrix with ones in the positions 
indexed by e and zeros elsewhere. 

We start with the update formula (4.1). Recall that k{j) is the state of the decision variable's 
counter when the update to Xj was read. We have 

Xj+i = Xj - j\ej\Vv^Gej{x},(j)) . 

By subtracting x^, from both sides, taking norms, we have 

1 2^2 T 

Xj+i — Xi,\\2 = 2 ~ •^*ll2 ~ TI^jK^J ~ 'PvjGejiXf^f^j-^) 

+ ll\j\^\\Vv,Ge^iXkiJ))f 

^\\Xj -X^Wl --f\ej\{Xj - Xk(j)fVvPej{Xj) 
- - Xk{j)YVv^{Ge^{Xk{j)) - Ge^{Xj)) 

- 7\ej\{xk(j) - x^YVv^Ge^{xk(j)) + -l\j?'\\Vv^Ge^{xk(j))f 



Let Oj = — x^llg]. By taking expectations of both sides and using the bound (4.4), we obtain 

flj+l < Oj - l^[{Xj - Xk{j)fGe^{Xj)] - 7E[(xj - Xk(j)f{Gej{Xk(j)) - Ge^{Xj))] 

where we recall that Vt = maxgg^; |e|. Here, in several places, we used the useful identity: for any 
function C, ol xi, . . . ,Xj and any i < j, we have 

E[\ej\Vy^Gejixi)'^Cixi, Xj)] =E{E [\ej\Vy^Gej{xi)'^C{xi, ...,Xj) \ ei, . . . , ej,vi, . . . , Vj^i] } 

= E[Ge,ixifCixi,...,Xj)] . 

We will perform many similar calculations throughout, so, before proceeding, we denote 

e[j] := (ei,e2, . . . ,ej,fi,t>2, . . ■,Vi), 

to be the tuple of all edges and vertices selected in updates 1 through i. Note that xi depends 
on e[;_i] but not on Cj or Vj for any j > I. We next consider the three expectation terms in this 
expression. 

Let's first bound the third expectation term in (A. 3). Since is independent of Cj we have 

E [(Xfc(j) - Xi,)^Ge^{Xk(j))] 

= E{E [{x^j) - x^)'^Ge^.(xfc(j))|e[fc(j)_i]] } 
= ^{(.Xk(j) - x^)^E [Ge,{xk{j))\e[k{j)-i]] } 
= E [{xk{j) - x^fVf{xk(^j))] , 
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where V/ denotes an element of df. It follows from (A. 2) that 

The first expectation can be treated similarly: 

E[{xj - Xk(j)fGe^{xj)] = E {E[{xj - Xk(j)fGe^{xj) I ey_i]]} 

= E {(xj - Xk(j)f^Ge^{xj) I e[j_i]]} 
= E[(:e, -Xfc(,)fV/(x,-)] 

mixj) - f{xk(j))] + :^m\xj - Xk(j)\ 



(A.4) 



> 



(A.5) 



where the final inequality is from (A.l). Moreover, we can estimate the difference between f{xj] 
and /(xfc(j)) as 

i=k{j) 

= E T.^[fe{x^) - feixi+l)] 
i=k(j) e£E 

i-1 



i=k{j) eeE 



(A.6) 



Here we use the inequality 



1 



fe{xi) - fe{xi+l) < ■r^Ge(Xi) {Xi - Xj+i) 



\E\ 



GeiXi) GeAxi) 



which follows because fe is convex. By combining (A.5) and (A.6), we obtain 



E[ixj - Xfc(,))^Ge,,(x,)] > -7rpM^ + ^E[\\x, - ] . 



(A.7) 
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We turn now to the second expectation term in ( |A.3 ). We have 

E [{Xj - XfcQ-))^(Ge^(Xfc(j-)) - Ge^iXj))] 



E 



E 



E 



i=k(j) 



i-1 



^ 7|ei|Ge,(Xfc(i))^(Ge^.(Xfc(j)) -Ge^{Xj)) 
i=k(j) 



J-1 



^ -i\ei\Ge,{Xk(i))^ {Ge^{Xk^j)) - Ge^{xj)) 



> -E 



^ -f\ei\\\Ge,{x^i))\\ \\Ge^{Xk(j)) -Ge^{xj)\\ 



> -E 



i=fc{j) 



i-1 

2 



(Ai 



where p is defined by (2.6). Here, the third hne follows from om definition of the gradient update. 
The fourth line is tautological: only the edges where Cj and ej intersect nontrivially factor into the 
sum. The subsequent inequality is Cauchy-Schwarz, and the following line follows from (4.4). 
By substituting (A.4), (A.7), and (A. 8) into (A.3), we obtain the following bound: 



2^,2 



ij+i < aj - C7 (afc(j) + lE[\\xj - Xk{j)f]) + ^^-^ + 2Tp + iUpr) . 



(A.9) 



To complete the argument, we need to bound the remaining expectation in (A.9). We expand 
out the expression multiplied by cy in (A.9) to find 



aj-E 



aj - E 



^ {Xi+l - Xif{Xk(j) - x^) 
i=k{j) 



i-1 



^h\Ge,{Xkii)fVv,{Xk{j) 

i=k{j) 



Let er-,ji denote the set of all sampled edges and vertices except for Cj and Vi. Since Cj and Vi 
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are both independent of Xj^Q), we can proceed to bound 



E 



Yl 7|e^|G'e,(Xfc(i))'^T'^,,(Xfc(J) - X*) 
i=k{j) 



i-1 

^ jnM\\r,,{xki,) - x,)\\2 

i=k{j) 
i=A:(j) 
i=fc(j) 
i=k(j) 

<TjnMA^/''E[\\xkij) - x42] 
<T7l^MA^/2 {E[\\xj - x^lla] + T^nM) 



/2- 



where A is defined in (2.6). The first inequahty is Cauchy-Schwartz. The next inequahty is Jensen. 
The second to last inequahty fohows from our definition of xj, and the final inequality is Jensen 
again. 



Plugging the last two expressions into ( |A.9 ), we obtain 



where 



-1 < (1 - n)aj + 7^ (^/2cOMr af^ + ^M'^j^Q 
Q = n + 2Tp + A9.pT + 2t^9'^A^/^ . 



(A.IO) 



Here we use the fact that 07 < 1 to get a simplified form for Q. This recursion only involves 
constants involved with the structure of /, and the nonnegative sequence aj. To complete the 
analysis, we will perform a linearization to put this recursion in a more manageable form. 
To find the steady state, we must solve the equation 



(A.ll) 



This yields the fixed point 



2 \ y C7 / 

< ^ (nrA'/' + V^^r^A + q) " = C{r,p,A,n) 



(A.12) 



2c 



2c 
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Note that for p and A sufficiently small, C(t, p, A, il) ~ 1. 

Since the square root is concave, we can linearize ( |A.10[ ) about the fixed point a^o to yield 



aj+i < (1 - C7)(aj 



, 72 (^/2cl7MrAV2) 

Ooo) H „ , — [dj 



+ (1 - c7)ooo + 7^ (\/2cl7MrA^/2^ aj/^ + ^M^^g 



Here 



(1 - C7(l - 6)) {a, 
1 



+ Oc 



< 



n2r2A 



To summarize, we have shown that the sequence aj of squared distances satisfies 
flj+i < (1 - C7(l - (5(r, p. A, 9))){aj - Ooo) + aoo 



(A.13) 



with Ooo < CCt", P, A, ri)^^^^. In the case that r = (the serial case), C(t, p, A,f2) = $7 and 
5(r, p, A,ri) = 0. Note that if r is non-zero, but p and A are o(l/n) and o(l/-^/n) respectively, 
then as long as r = o(n^/^), C(r, p, A,r2) = 0(1). In our setting, r is proportional to the number 
of processors, and hence as long as the number of processors is less n^/^, we get nearly the same 
recursion as in the linear rate. 



In the next section, we show that (A.13) is sufficient to yield a l/fc convergence rate. Since we 



can run p times faster in our parallel setting, we get a linear speedup. 



A. 2 Proof of Proposition |4.1| Final Steps 

Since V/ is Lipschitz, we have 

f{x) < fix') + V/(x')^(x - x') + ^\\x- x'f for ah x, x G M" . 



Setting x' = gives f{x) — /(x*) < 



Hence, 



E[/(xfc) - fix,)] < Lak 

for all k. To ensure the left hand side is less than e, it suffices to guarantee that < e/L. 

To complete the proof of Proposition 4.1 , we use the results of Section |4j We wish to achieve a 
target accuracy of e/L. To apply (5.3), choose Ooo as in (A. 12) and the values 



Cr = c(l — 5(r, p. A, ri) 

B = CiT,p,A,n)—l 
2c 



By (A. 12), we have Ooo < 'jB. 



Choose 7 satisfying (4.5). With this choice, we automatically have 

ec 



7< 



< 



LMWA 1 + Jl + ^ 



2LB 
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because (1 + \/l + x)^ < 4 + 2x for all x > 0. Substituting this value of 7 into (5.3), we see that 

LMHog{LDo/e) C(r,p,A,Q) 



^ - ec2 1 - 5{t, p, a, J)) 

iterations suffice to achieve < e/L. Now observe that 



l-5(r,p,A,J]) 



1 + A/1 + 



1 



Q 



1 + A/1 + 



1 + 



n^T^A 

< SJ^V^A + 2Q 

< 2f7(l + 6rp + 6r2l7A^/2)_ 



Here,the second to last inequality follows because 



< 8 + 2x 



(A.14) 



for all X > 0. Plugging this bound into (A.14) completes the proof. 
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